[Hadoop] Hadoop配置支持LZO压缩格式

安装lzop、配置、测试

Posted by 李玉坤 on 2017-08-20

简介

安装lzo
lzo并不是linux系统原生支持,所以需要下载安装软件包。这里至少需要安装3个软件包:lzo, lzop, hadoop-gpl-packaging。

增加索引
gpl-packaging的作用主要是对压缩的lzo文件创建索引,否则的话,无论压缩文件是否大于hdfs的block大小,都只会按照默认启动2个map操作。

安装lzop native library

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
安装相关依赖
[root@hadoop etc]# yum -y install lzo-devel zlib-devel gcc autoconf automake libtool
下载lzo
[hadoop@hadoop software]$ wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
--2019-04-14 16:02:12-- http://www.oberhumer.com/opensource/lzo/download/lzo-2.06.tar.gz
正在解析主机 www.oberhumer.com (www.oberhumer.com)... 193.170.194.40
正在连接 www.oberhumer.com (www.oberhumer.com)|193.170.194.40|:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:583045 (569K) [application/x-gzip]
正在保存至: “lzo-2.06.tar.gz”

100%[============================================================>] 583,045 173KB/s 用时 3.3s

2019-04-14 16:02:22 (173 KB/s) - 已保存 “lzo-2.06.tar.gz” [583045/583045])

[hadoop@hadoop software]$
[hadoop@hadoop software]$ tar -zxvf lzo-2.06.tar.gz -C ../app/
进入安装目录进行编译
[hadoop@hadoop lzo-2.06]$ export CFLAGS=-m64
[hadoop@hadoop lzo-2.06]$ ./configure -enable-shared -prefix=/home/hadoop/app/lzo/
[hadoop@hadoop lzo-2.06]$ make && sudo make install
编译完lzo包之后,会在/home/hadoop/app/lzo/生成一些文件。
将/home/hadoop/app/lzo/目录下的所有文件打包,并同步到集群中的所有机器上。

安装hadoop-lzo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
下载hadoop-lzo
[hadoop@hadoop software]$ wget https://github.com/twitter/hadoop-lzo/archive/master.zip
解压
[hadoop@hadoop software]$ unzip master.zip -d ../app/
进入解压目录
[hadoop@hadoop app]$ cd hadoop-lzo-master/
[hadoop@hadoop hadoop-lzo-master]$

因为hadoop使用的是2.6.0;所以版本修改为2.6.0:

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<hadoop.current.version>2.6.0</hadoop.current.version>
<hadoop.old.version>1.0.4</hadoop.old.version>
</properties>

[hadoop@hadoop hadoop-lzo-master]$ export CFLAGS=-m64
[hadoop@hadoop hadoop-lzo-master]$ export CXXFLAGS=-m64

修改为自己hadoop的实际路径
[hadoop@hadoop hadoop-lzo-master]$ export C_INCLUDE_PATH=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lzo/include
[hadoop@hadoop hadoop-lzo-master]$ export LIBRARY_PATH=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/lzo/lib
[hadoop@hadoop hadoop-lzo-master]$
mvn编译
[hadoop@hadoop hadoop-lzo-master]$ mvn clean package -Dmaven.test.skip=true

[INFO] Building jar: /home/hadoop/app/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT-javadoc.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8:43.929s
[INFO] Finished at: Sun Apr 14 16:42:02 CST 2019
[INFO] Final Memory: 25M/61M
[INFO] ------------------------------------------------------------------------
[hadoop@hadoop hadoop-lzo-master]$
进入target文件夹
[hadoop@hadoop hadoop-lzo-master]$ cd target/native/Linux-amd64-64/
[hadoop@hadoop Linux-amd64-64]$ pwd
/home/hadoop/app/hadoop-lzo-master/target/native/Linux-amd64-64
[hadoop@hadoop Linux-amd64-64]$
[hadoop@hadoop Linux-amd64-64]$ mkdir ~/app/hadoop-lzo-files
[hadoop@hadoop Linux-amd64-64]$ tar -cBf - -C lib . | tar -xBvf - -C ~/app/hadoop-lzo-files
[hadoop@hadoop hadoop-lzo-files]$ cp ~/app/hadoop-lzo-files/libgplcompression* $HADOOP_HOME/lib/native/
注意!!!上面这一步的cp文件也要同步到集群其他主机的hadoop的对应目录下

cp hadoop-lzo的jar包到hadoop目录

[hadoop@hadoop target]$ pwd
/home/hadoop/app/hadoop-lzo-master/target
[hadoop@hadoop target]$ cp hadoop-lzo-0.4.21-SNAPSHOT.jar $HADOOP_HOME/share/hadoop/common/
注意!!!上面这一步的cp文件也要同步到集群其他主机的hadoop的对应目录下

在hadoop的core-site.xml中添加

vim core-site.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
<property>  
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

在mapred-site.xml中添加

vim mapred-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
输入阶段的压缩
<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

最终阶段的压缩
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>

<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>

注意:上面修改的core-xite.xml与mapred-site.xml文件也要同步到集群其他主机的hadoop的对应目录下;**最后重启集群!!!**

【hadoop fs -du -s -h 文件路径 查看大小】

检验是否成功

测试mapreduce input为lzo和output为bzip2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

清洗数据的mr代码加上输出的压缩格式
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

job.setInputFormatClass(LzoTextInputFormat.class);

同时需要添加maven依赖

<!-- https://mvnrepository.com/artifact/com.hadoop.gplcompression/hadoop-lzo -->
<dependency>
<groupId>com.hadoop.gplcompression</groupId>
<artifactId>hadoop-lzo</artifactId>
<version>cdh4-0.4.15-gplextras</version>
</dependency>

打包成hadoop_train-lzo-bzip2.jar

因为我把block块默认调为了10m;准备一个10以上的压缩数据;
[hadoop@hadoop data]$ hadoop fs -du -s -h /input/lzo/test.log.lzo
32.8 M 32.8 M /input/lzo/test.log.lzo

进入lzo包生成index文件
[hadoop@hadoop target]$ hadoop jar hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /input/lzo/test.log.lzo/
19/04/16 17:10:49 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
19/04/16 17:10:49 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
19/04/16 17:10:51 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /input/lzo/test.log.lzo, size 0.03 GB...
19/04/16 17:10:51 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
19/04/16 17:10:52 INFO lzo.LzoIndexer: Completed LZO Indexing in 1.00 seconds (32.97 MB/s). Index size is 8.84 KB.

[hadoop@hadoop target]$ pwd
/home/hadoop/app/hadoop-lzo-master/target
[hadoop@hadoop target]$

[hadoop@hadoop target]$ hadoop fs -ls /input/lzo/
Found 2 items
-rw-r--r-- 1 hadoop supergroup 34428125 2019-04-16 16:30 /input/lzo/test.log.lzo
-rw-r--r-- 1 hadoop supergroup 9056 2019-04-16 17:10 /input/lzo/test.log.lzo.index


执行mapredece程序可以发现number of splits:2
[hadoop@hadoop data]$ hadoop jar hadoop_train-lzo-bzip2.jar com.kun.hadoop.mapreduce.driver.LogETLDriver /input/lzo/ /output/lzo/
19/04/16 19:11:44 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/04/16 19:11:46 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
19/04/16 19:11:46 INFO input.FileInputFormat: Total input paths to process : 2
19/04/16 19:11:46 INFO mapreduce.JobSubmitter: number of splits:4

Job Counters
Launched map tasks=4
Launched reduce tasks=1
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=427208
Total time spent by all reduces in occupied slots (ms)=39870
Total time spent by all map tasks (ms)=427208
Total time spent by all reduce tasks (ms)=39870
Total vcore-seconds taken by all map tasks=427208
Total vcore-seconds taken by all reduce tasks=39870
Total megabyte-seconds taken by all map tasks=437460992
Total megabyte-seconds taken by all reduce tasks=40826880


查看输出文件
[hadoop@hadoop target]$ hadoop fs -ls /output/lzo/
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2019-04-16 19:14 /output/lzo/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 5580934 2019-04-16 19:14 /output/lzo/part-r-00000.bz2

创建一个测试表
hive> create external table bzip2_test(
> cdn string,
> region string,
> level string,
> time string,
> ip string,
> domain string,
> url string,
> traffic bigint
> ) partitioned by (day string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> LOCATION '/home/hadoop/data/clear';
OK
Time taken: 0.215 seconds
hive>


移动数据到外部表对应的目录
[hadoop@hadoop data]$ hadoop fs -ls /home/hadoop/data/clear
[hadoop@hadoop data]$ hadoop fs -mkdir -p /home/hadoop/data/clear/day=20190416
[hadoop@hadoop data]$ hadoop fs -mv /output/lzo/part-r-00000.bz2 /home/hadoop/data/clear/day=20190416
[hadoop@hadoop data]$ hadoop fs -ls /home/hadoop/data/clear/day=20190416
Found 1 items
-rw-r--r-- 1 hadoop supergroup 5580934 2019-04-16 19:14 /home/hadoop/data/clear/day=20190416/part-r-00000.bz2
[hadoop@hadoop data]$

刷数据
hive> alter table bzip2_test add if not exists partition(day='20190416');
OK
Time taken: 2.956 seconds

查看
hive> select * from bzip2_test limit 1;
OK
yahu AE W 20190416111803 63.72.55.168 shabi.com - 746411 20190416
Time taken: 1.176 seconds, Fetched: 1 row(s)
hive>

可以看到hive可以查到bzip压缩数据

单测试input为lzo的

插入知识点:hive的中间数据压缩设置set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.XXX(SnappyCodec);

更改上一步得代码
FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

FileOutputFormat.setOutputCompressorClass(job, LzopCodec.class);

输出数据作为下面hive得测试数据

在hive中创建lzo格式的表;刷数据后要进入表数据目录生成index文件;可查看表数据;同时进行聚合操作也支持分割

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
创建外部表
hive> create external table lzo_test (
> cdn string,
> region string,
> level string,
> time string,
> ip string,
> domain string,
> url string,
> traffic bigint
> ) partitioned by (day string)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION '/home/hadoop/data/clear' ;
OK
Time taken: 26.37 seconds
hive>

移动数据到相对应得目录下
[hadoop@hadoop data]$ hadoop fs -cp /output/lzo/part-r-00000.lzo /home/hadoop/data/clear/day=20190416
[hadoop@hadoop data]$ hadoop fs -ls /home/hadoop/data/clear/day=20190416
Found 1 items
-rw-r--r-- 1 hadoop supergroup 16939951 2019-04-16 19:52 /home/hadoop/data/clear/day=20190416/part-r-00000.lzo
[hadoop@hadoop data]$

进入/home/hadoop/data/clear/day=20190416对lzo文件生成index操作
[hadoop@hadoop data]$ hadoop jar ~/app/hadoop-lzo-master/target/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /home/hadoop/data/clear/day=20190416/part-r-00000.lzo
19/04/16 20:15:02 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library from the embedded binaries
19/04/16 20:15:02 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev f1deea9a313f4017dd5323cb8bbb3732c1aaccc5]
19/04/16 20:15:04 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /home/hadoop/data/clear/day=20190416/part-r-00000.lzo, size 0.02 GB...
19/04/16 20:15:04 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
19/04/16 20:15:05 INFO lzo.LzoIndexer: Completed LZO Indexing in 1.12 seconds (14.48 MB/s). Index size is 1.78 KB.

[hadoop@hadoop data]$

刷数据
hive> alter table lzo_test add if not exists partition(day='20190416');
OK
Time taken: 11.12 seconds
hive> select * from lzo_test limit 1;
OK
yahu AE W 20190416111803 63.72.55.168 shabi.com - 746411 20190416
Time taken: 6.673 seconds, Fetched: 1 row(s)
hive>

执行mapreduce
[hadoop@hadoop data]$ hadoop fs -du -s -h /home/hadoop/data/clear/day=20190416/part-r-00000.lzo
16.2 M 16.2 M /home/hadoop/data/clear/day=20190416/part-r-00000.lzo
[hadoop@hadoop data]$

可以看到split为2【因为我的block为10M】
hive> select count(1) from lzo_test;
Query ID = hadoop_20190416194747_ab77e496-bd78-4aaa-8a32-6bf34e98a2d1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1555410311484_0006, Tracking URL = http://hadoop:8088/proxy/application_1555410311484_0006/
Kill Command = /home/hadoop/app/hadoop-2.6.0-cdh5.7.0/bin/hadoop job -kill job_1555410311484_0006
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1

OK
706000
Time taken: 69.802 seconds, Fetched: 1 row(s)